Add HydraGNN distributed training recipe for AMD MI355X by ashwinma · Pull Request #32 · AMDResearch/ai4science-studio

ashwinma · 2026-05-19T06:50:23Z

Summary

Adds sbatch_train_amd.sh for multi-node HydraGNN GFM training on AMD Instinct MI355X via SLURM + Apptainer (MPI-enabled ADIOS, no DDStore)
Overhauls build_overlay_amd.sh to compile adios2 from source with MPI, pin HydraGNN SHA, and use node-local scratch for build I/O
Documents the full workflow in recipes/train/README.md with validated performance numbers from a 50-batch sanity test (1 node, 8 GPUs)

What's included

File	Change
`examples/sbatch_train_amd.sh`	New — SLURM batch script with embedded rank script, RCCL multi-node env vars, monkey-patch for avg_num_neighbors
`examples/build_overlay_amd.sh`	Rewritten — MPI-enabled adios2, node-local build, vesin/e3nn deps, pinned SHA
`examples/run_train.sh`	Updated — now a Docker/single-GPU entrypoint; defers to sbatch for HPC
`recipes/train/README.md`	Complete rewrite — env var reference, launch diagram, validated metrics
`model.yaml`	Updated container image (rocm7.2.2), added `AI4S_SHARED_DIR` and training env vars
`.cursor/skills/ai4science-studio/SKILL.md`	Lessons: `--gpus-per-node` requirement, PMIx shared memory fix
`.cursor/skills/ai4science-material-science/SKILL.md`	HydraGNN multi-node training pattern
`.claude/commands/init-cluster.md`	Added scratch_local, RCCL socket_ifname, IB HCA discovery

Validated on

Cluster: Vultr Lux (MI355X gfx950)
Config: 1 node, 8 GPUs, batch_size=200, fp64, ANI1x + Alexandria datasets
Result: 50 batches in ~130s (2.4 s/batch steady state), 7.5 GB peak VRAM, no RCCL errors

Key design decisions

Rank script is embedded in the batch script (heredoc) — not a separate repo file. Generated at runtime to $HG_OUTPUT_DIR/hydragnn-rank-<jobid>.sh for debuggability.
No DDStore (phase 1) — each rank opens ADIOS files directly via MPI communicator. DDStore deferred to phase 2 for 32+ node scale.
Monkey-patch avg_num_neighbors — injects precomputed value (13.74) to skip expensive full-dataset neighbor-degree scan at init.
SLURM env vars not explicitly passed — Apptainer inherits the host environment; only script-computed or defaulted vars use --env.
Multi-node RCCL vars gated on NODES > 1 — single-node runs don't need IB/network config.

Test plan

50-batch sanity test (1 node / 8 GPUs) — passed
Multi-node (2+ nodes) validation
Full-epoch training run
Overlay build from scratch on a fresh node

Adds sbatch_train_amd.sh for multi-node/multi-GPU HydraGNN GFM training using MPI-enabled ADIOS2 multi-dataset loading (no DDStore) on AMD Instinct MI355X via Apptainer. Key design decisions validated on Lux cluster: - Use --gpus-per-node=8 (not --gpus-per-task=1) to allow RCCL full topology discovery via KFD sysfs; per-task isolation causes "Could not read node" RCCL errors on MI355X - Single-node only needs HSA_NO_SCRATCH_RECLAIM=1 (wiki-documented) - Multi-node RCCL env vars (IB HCA, socket ifname, etc.) are parameterized for site-specific override - Monkey-patch AdiosMultiDataset.avg_num_neighbors to skip expensive full-dataset degree scan at init Also updates: - build_overlay_amd.sh: MPI-enabled adios2 build from source with cmake - model.yaml: add training recipe and AI4S_SHARED_DIR env var - recipes/train/README.md: full runbook - init-cluster: add scratch_local and RCCL network discovery - Studio + material science skills: GPU visibility lesson, PMIx fix Co-authored-by: Cursor <cursoragent@cursor.com>

50-batch sanity test passed on MI355X (1 node, 8 GPUs): - 2.4 s/batch steady state, 130s total training time - 7.5 GB peak allocated / 9.0 GB reserved per GPU - No RCCL errors, all ranks converged - Environment: fp64, batch_size=200, ANI1x+Alexandria datasets Also adds HYDRAGNN_MAX_NUM_BATCH, HYDRAGNN_VALTEST, SCRATCH_LOCAL to the environment variable reference table. Co-authored-by: Cursor <cursoragent@cursor.com>

Apptainer inherits the host process environment by default — SLURM_JOB_ID, SLURM_JOB_NUM_NODES, SLURM_PROCID, and SLURM_CPUS_PER_TASK are already set by srun's PMIx launcher in each rank's environment. Explicit --env lines for these were misleading (implied they wouldn't propagate). Co-authored-by: Cursor <cursoragent@cursor.com>

Previously, RCCL fell back to socket transport because the ANP plugin and libionic were not bind-mounted into the container (--rocm does not expose them). This adds the required bind-mounts and MPI ob1/tcp configuration for Pensando/ionic fabrics. Key changes: - Bind-mount librccl-anp.so and libionic.so.1 from host into container - Add MPI ob1/tcp transport (ionic /31 subnets don't route for verbs) - Read all cluster-specific values from .cluster-config.yaml (gitignored) - Add network fields to .cluster-config.example.yaml - Add inference recipe and convergence tracking tooling - Remove all cluster-specific names from committed files Validated: 2-node/16-GPU training with NCCL_DEBUG=INFO confirms RCCL-ANP plugin loaded, all 8 ionic HCAs active, GDRDMA channels established for cross-node GPU allreduce. Co-authored-by: Cursor <cursoragent@cursor.com>

ashwinma and others added 4 commits May 19, 2026 02:14

ashwinma merged commit c0dafba into main May 20, 2026
2 checks passed

ashwinma deleted the hydragnn-distributed-training branch May 20, 2026 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HydraGNN distributed training recipe for AMD MI355X#32

Add HydraGNN distributed training recipe for AMD MI355X#32
ashwinma merged 4 commits into
mainfrom
hydragnn-distributed-training

ashwinma commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashwinma commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Validated on

Key design decisions

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ashwinma commented May 19, 2026 •

edited

Loading